If you’re reading this for the 1st time, I recommend you read the Objectives and look at the resulting figures at the end of the document to get an idea of what I’m doing before digging too deeply into the methods; that said, DO dig into the methods afterwards



Objective

Cluster high-flow event characteristics and antecedent watershed conditions to evaluate how these factors converge as flux regimes (clusters) to produce variability in:
1. Event NO3 yields
2. Event SRP yields
3. Event turbidity yields
4. Event NO3:SRP yield ratios


Select variables to keep

Per K Underwood: if two variables are strongly correlated (negatively or positively) they can effectively “double-weight” a particular factor important in driving clustering; thus, keep just one of the variables to serve as a proxy for that factor

Decisions for eliminating variables w/ correlations >70%

These were used for the 2020-12-10 run:

  • These decisions were tough to make and need review
    • rain_event_total_mm and API_4d are highly correlated (88%), but represent different things; I’m going to leave both for now, but could try versions w/ one or the other to see if it matters

  • Tough decisions that need review (continued)
    • q_event_max and q_mm are highly correlated (83.8%); let’s drop q_event_max for now, because?
    • SoilTemp_pre_wet_15cm and VWC_pre_wet_15cm are correlated (72.3%), but they’re pretty different and close enough to 70% correlation cutoff; will leave both for now and could try versions w/ one or the other to see if it matters

  • I feel OK about these decisions, but they should be reviewed as well
  • If the 1-d and 4-d values for a variable are highly correlated, use the 4-d value
    • gw_1d_allWells and gw_4d_allWells are highly correlated (99.2%); remove gw_1d_allWells
  • Trying to find a VWC variable that correlates well with GW level so that I can remove GW level vars (no GW data in 2017)
    • when dropping all GW variables; n obs increases from 45 to 51
    • gw_4d_allWells and VWC_pre_wet_30cm most highly correlated (94.5%); same with VWC_pre_wet_15cm (94.2%), except slightly less linear at higher values
    • drop gw_4d_allWells and use VWC as a proxy for GW level
    • VWC_pre_wet_15cm and VWC_pre_wet_30cm highly correlated (94.7%)
    • I want to keep all 15cm values for soil vars including VWC, so for now we’ll drop VWC_pre_wet_30cm to avoid double-weighting VWC
  • MET variables
    • airT_1d and airT_4d are highly correlated (90.1%); removing airT_1d
    • airT_4d and dewPoint_4d are highly correlated (96.6%), as is dewPoint_1d; removing both dewPoints
    • airT_4d and SoilTemp_pre_wet_15cm are highly correlated (92.9%); drop airT_4d b/c we still have diff_airT_soilT
    • solarRad_1d and solarRad_4d are highly correlated (74.7%); drop solarRad_1d per rule above
  • Q
    • q_event_delta & q_event_max are highly correlated (96.3%); q_event_max is more normally distributed, so let’s keep it vs delta
    • q_1d and q_4d are highly correlated (86.4%), so sticking with rule above will keep the q_4d
    • Drop q_event_dQRate_cmsPerHr b/c it’s confusing and hopefully q_event_delta or rain intensity will capture this
  • Redox
    • If redox variables prove not to be important or they are highly correlated with another variable, we can remove them and increase n obs by at least 7
  • Rain
    • Drop all the rain_Xd vars, b/c API_4d should cover this, though would be interesting to test how many days pre-event (e.g., 4 days for API) matters
    • rain_int_mmPERmin_mean and rain_int_mmPERmin_max are correlated (74.4%); drop _max and so we can keep the mean intensity of the rain event
  • Stream
    • turb_1d proved to be unuseful in driving clusters in SOM, so I removed it

Look at correlations again after dropping variables

Self-organizing map (SOM)

Prepare data & set up grid/lattice dimensions

We’re only using complete observations/rows (no NAs in any columns)
According to the heuristic rule from Vesanto 2000, number of grid elements/grid size/nodes = 5 * sqrt(n)
To determine the the shape of the grid (ratio of columns to rows), we use the ratio of the first two eigen values of the input data set as recommended by Park et al. 2006

## [1] No. of complete observations: 49 out of 76 observations
## [1] No. of Vesanto nodes: 35
## [1] Ratio of columns to rows: 1.8



Run SOM for a suite of grid/lattice configurations, # of nodes, and # of clusters

Code courtesy of Kristen Underwood (hidden)

## [1] Topology: hexagonal
## [1] Data normalization method used: L2norm
## [1] Weighting method used: noPCA
## [1] No. of iterations: 1500
## [1] alphaCrs used: 0.05
## [1] alphaFin used: 1500



Choose the best SOM run based on non-parametric F-stat and quantization error

We want to maximize npF (ratio of b/w cluster variance) and minimize QE (mean distance b/w each data vector & best-matching unit)
Here are the top 33% of runs based on npF

Choose the best run from each high performing (top 33%) cluster
## [1] These were the top runs for each high-performing (top 33%) cluster #:
Run rows cols Nodes Clusters npF QE
9 5 7 35 3 25.050 0.105
39 4 10 40 5 23.620 0.083
128 6 8 48 6 23.145 0.099
23 4 10 40 4 22.811 0.124
## [1] The script chose:

Run rows cols Nodes Clusters npF QE
9 5 7 35 3 25.05 0.105


To examine this run in greater detail (e.g., component planes), see the ‘X_SOMplots_site_ … .pdf’


Examine boxplots of independent variables by cluster





Examine how antecedent and event conditions converge to influence N & P flux regimes:













How do our results differ if we choose the 2nd best SOM run?

## [1] The 2nd best run was:

Run rows cols Nodes Clusters npF QE
39 4 10 40 5 23.62 0.083


To examine this run in greater detail (e.g., component planes), see the ‘X_SOMplots_site_ … .pdf’